Session 3: Scraping Static Web Pages
2023-07-26
In this session, we trap some docile data that wants to be found. We will:
Joe Caione via unsplash.com
Inspect toolNote: Might not be available on all browsers; use Chromium-based or Firefox.
rvest to scrapelibrary(rvest)
library(tidyverse)
# 1. Request & collect raw html
html <- read_html("https://en.wikipedia.org/wiki/World_Happiness_Report")
# 2. Parse
happy_table <- html |>
html_elements(".wikitable") |> # select the right element
html_table() |> # special function for tables
pluck(3) # select the third table
# 3. No wrangling necessary
happy_table# A tibble: 153 × 9
`Overall rank` `Country or region` Score `GDP per capita` `Social support`
<int> <chr> <dbl> <dbl> <dbl>
1 1 Finland 7.81 1.28 1.5
2 2 Denmark 7.65 1.33 1.50
3 3 Switzerland 7.56 1.39 1.47
4 4 Iceland 7.50 1.33 1.55
5 5 Norway 7.49 1.42 1.50
6 6 Netherlands 7.45 1.34 1.46
7 7 Sweden 7.35 1.32 1.43
8 8 New Zealand 7.3 1.24 1.49
9 9 Austria 7.29 1.32 1.44
10 10 Luxembourg 7.24 1.54 1.39
# ℹ 143 more rows
# ℹ 4 more variables: `Healthy life expectancy` <dbl>,
# `Freedom to make life choices` <dbl>, Generosity <dbl>,
# `Perceptions of corruption` <dbl>
rvest to scrape# 1. Request & collect raw html
html <- read_html("https://en.wikipedia.org/w/index.php?title=List_of_prime_ministers_of_the_United_Kingdom&oldid=1166167337") # I'm using an older version of the site since some just changed it
# 2. Parse
pm_table <- html |>
html_element(".wikitable:contains('List of prime ministers')") |>
html_table() |>
as_tibble(.name_repair = "unique") |>
filter(!duplicated(`Prime ministerOffice(Lifespan)`))
# 3. No wrangling necessary
pm_table# A tibble: 75 × 11
Portrait...1 Portrait...2 Prime ministerOffice(Lifespa…¹ `Term of office...4`
<chr> <chr> <chr> <chr>
1 "Portrait" "Portrait" Prime ministerOffice(Lifespan) start
2 "" "" Robert Walpole[27]MP for King… 3 April1721
3 "" "" Spencer Compton[28]1st Earl o… 16 February1742
4 "" "" Henry Pelham[29]MP for Sussex… 27 August1743
5 "" "" Thomas Pelham-Holles[30]1st D… 16 March1754
6 "" "" William Cavendish[31]4th Duke… 16 November1756
7 "" "" Thomas Pelham-Holles[32]1st D… 29 June1757
8 "" "" John Stuart[33]3rd Earl of Bu… 26 May1762
9 "" "" George Grenville[34]MP for Bu… 16 April1763
10 "" "" Charles Watson-Wentworth[35]2… 13 July1765
# ℹ 65 more rows
# ℹ abbreviated name: ¹`Prime ministerOffice(Lifespan)`
# ℹ 7 more variables: `Term of office...5` <chr>, `Term of office...6` <chr>,
# `Mandate[a]` <chr>, `Ministerial offices held as prime minister` <chr>,
# Party <chr>, Government <chr>, MonarchReign <chr>
<td rowspan="4">
<span class="anchor" id="18th_century"></span>
<b>
<a href="/wiki/Robert_Walpole" title="Robert Walpole">Robert Walpole</a>
</b>
<sup id="cite_ref-FOOTNOTEEccleshallWalker20021,_5EnglefieldSeatonWhite19951–5PrydeGreenwayPorterRoy199645–46_28-0" class="reference">
<a href="#cite_note-FOOTNOTEEccleshallWalker20021,_5EnglefieldSeatonWhite19951–5PrydeGreenwayPorterRoy199645–46-28">[27]</a>
</sup>
<br>
<span style="font-size:85%;">MP for <a href="/wiki/King%27s_Lynn_(UK_Parliament_constituency)" title="King's Lynn (UK Parliament constituency)">King's Lynn</a>
<br>(1676–1745)
</span>
</td>links <- html |>
html_elements(".wikitable:contains('List of prime ministers') b a") |>
html_attr("href")
title <- html |>
html_elements(".wikitable:contains('List of prime ministers') b a") |>
html_text()
tibble(name = title, link = links)# A tibble: 90 × 2
name link
<chr> <chr>
1 Robert Walpole /wiki/Robert_Walpole
2 George I /wiki/George_I_of_Great_Britain
3 George II /wiki/George_II_of_Great_Britain
4 Spencer Compton /wiki/Spencer_Compton,_1st_Earl_of_Wilmington
5 Henry Pelham /wiki/Henry_Pelham
6 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
7 William Cavendish /wiki/William_Cavendish,_4th_Duke_of_Devonshire
8 Thomas Pelham-Holles /wiki/Thomas_Pelham-Holles,_1st_Duke_of_Newcastle
9 George III /wiki/George_III
10 John Stuart /wiki/John_Stuart,_3rd_Earl_of_Bute
# ℹ 80 more rows
Note: these are relative links that need to be combined with https://en.wikipedia.org/ to work
rvest has two functions: html_text and html_text2. Explain the difference. You can test your explanation with the example html below.links objects so that it contains actual URLs?pm_table to keep everything together?I started the code below, now it’s your turn to finish it:
# 1. Request & collect raw html
html <- read_html("https://www.ic2s2.org/program.html")
sessions <- html |>
html_elements(".nav_list")
# 2. Parse
talks <- sessions |>
html_elements("li")
talks_titles <- talks |>
html_elements("") |>
html_text()
talks_speaker <- talks |>
html_elements("") |>
html_text()
talks_authors <- talks |>
html_elements("") |>
html_text()# 1. Request & collect raw html
html <- read_html("https://www.ic2s2.org/program.html")
sessions <- html |>
html_elements(".nav_list")
# 2. Parse
talks <- sessions |>
html_elements("li")
talks_titles <- talks |>
html_elements("b") |>
html_text()
talks_speaker <- talks |>
html_elements("u") |>
html_text()
talks_authors <- talks |>
html_elements("i") |>
html_text()
head(talks_titles)[1] "9:45 - The dynamics of writing style in scientific collaborations [705]"
[2] "9:51 - Performative Capitalism in Silicon Valley: Examining Performative and Substantive Political Behaviors of U.S. Technology Companies [581]"
[3] "9:57 - Spillover of Antisocial Behavior from Fringe Platforms: The Unintended Consequences of Community Banning [537]"
[4] "10:03 - The Microtargeting Manipulation Machine [383]"
[5] "10:09 - Sentence-Level Explanations of Word Embeddings Associations [518]"
[6] "10:15 - Mapping the Effect of Altruistic Punishment on Cooperation [730]"
[1] "Jonas L Juul" "Chloe Ahn" "Giuseppe Russo" "Almog Simchon"
[5] "Arianna Pera" "Mohammed Alsobay"
[1] "Jonas L Juul, Jon Kleinberg"
[2] "Chloe Ahn, Xinyi Wang"
[3] "Giuseppe Russo, luca verginer, Manoel Horta Ribeiro, Giona Casiraghi"
[4] "Almog Simchon, Adam Sutton, Matthew Edwards, Stephan Lewandowsky"
[5] "Arianna Pera, Manuel Vimercati, Matteo Palmonari"
[6] "Mohammed Alsobay, Abdullah Almaatouq, David G. Rand, Duncan J. Watts"
We can now tray to wrangle the data into a suitable format:
Error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 606: Existing data.
• Size 593: Column at position 2.
ℹ Only values of size one are recycled.
If several elements are children of a element that represents our “case”/data row, we can loop over the identified parent elements. For this, let’s write a function that parses one talk:
talks <- sessions |>
html_elements("li")
parse_talks <- function(x) {
title <- x |>
html_elements("b") |>
html_text()
speaker <- x |>
html_elements("u") |>
html_text()
author <- x |>
html_elements("i") |>
html_text()
tibble(title, speaker, author)
}
parse_talks(talks[[1]])# A tibble: 1 × 3
title speaker author
<chr> <chr> <chr>
1 9:45 - The dynamics of writing style in scientific collaborati… Jonas … Jonas…
Now we can use the function to loop over all talks:
talks_data <- map(talks, parse_talks) |>
bind_rows() |>
separate(col = title, into = c("time", "title"), sep = " - ")
talks_data# A tibble: 593 × 4
time title speaker author
<chr> <chr> <chr> <chr>
1 9:45 The dynamics of writing style in scientific collaborati… Jonas … Jonas…
2 9:51 Performative Capitalism in Silicon Valley: Examining Pe… Chloe … Chloe…
3 9:57 Spillover of Antisocial Behavior from Fringe Platforms:… Giusep… Giuse…
4 10:03 The Microtargeting Manipulation Machine [383] Almog … Almog…
5 10:09 Sentence-Level Explanations of Word Embeddings Associat… Ariann… Arian…
6 10:15 Mapping the Effect of Altruistic Punishment on Cooperat… Mohamm… Moham…
7 10:21 Social Contagion in Science [725] Satyak… Sara …
8 10:27 Big Moves: Tactics for early career success [710] Isabel… Isabe…
9 9:45 The motherhood penalty in mobility [418] Silvia… Silvi…
10 9:51 Momentary Symbols: Tracing the Visual Expressions of Pr… Emil B… Emil …
# ℹ 583 more rows
titles <- html |>
html_elements(".wrapper.style2") |>
html_elements(":not(header)>h2")
length(titles)[1] 48
[1] 48
talks_data <- map2(titles, sessions, function(x, y) {
chair <- x |>
html_element("i") |>
html_text()
session_title <- x |>
html_text() |>
str_remove(chair)
y |>
html_elements("li") |>
map(parse_talks) |>
bind_rows() |>
mutate(session = session_title, chair = chair, .before = 1)
}) |>
bind_rows()
talks_data# A tibble: 284 × 5
session chair title speaker author
<chr> <chr> <chr> <chr> <chr>
1 "Session 1A: \"Polarization I\" - July 18, 11:00,… Chai… 11:0… Marile… Maril…
2 "Session 1A: \"Polarization I\" - July 18, 11:00,… Chai… 11:1… Tianch… Tianc…
3 "Session 1A: \"Polarization I\" - July 18, 11:00,… Chai… 11:3… Nikola… Nikol…
4 "Session 1A: \"Polarization I\" - July 18, 11:00,… Chai… 11:4… Gianma… Corra…
5 "Session 1A: \"Polarization I\" - July 18, 11:00,… Chai… 12:0… Homa H… Homa …
6 "Session 1A: \"Polarization I\" - July 18, 11:00,… Chai… 12:1… Li Zha… Li Zh…
7 "Session 1B: \"Social networks I\" - July 18, 11:… Chai… 11:0… Melani… Melan…
8 "Session 1B: \"Social networks I\" - July 18, 11:… Chai… 11:1… Takahi… Takah…
9 "Session 1B: \"Social networks I\" - July 18, 11:… Chai… 11:3… Edwin … Dinge…
10 "Session 1B: \"Social networks I\" - July 18, 11:… Chai… 11:4… Veniam… Manoe…
# ℹ 274 more rows
:::: {.columns}
Why:
library(jsonlite)
json_string <- list(x = 1:10, y = list(z = 1:10, a = LETTERS[1:10])) |>
toJSON()
json_string{"x":[1,2,3,4,5,6,7,8,9,10],"y":{"z":[1,2,3,4,5,6,7,8,9,10],"a":["A","B","C","D","E","F","G","H","I","J"]}}
$x
[1] 1 2 3 4 5 6 7 8 9 10
$y
$y$z
[1] 1 2 3 4 5 6 7 8 9 10
$y$a
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
We use the standard selector to get the script inside the scheduleGrid_Container:
Dont’t try to print the entire string, since it is quite large and might crash RStudio!
[1] 1985406
[1] "DevExpress.utils.readyCallbacks.add((function($){$(\"#AcademicProgramme_ScheduleGrid\").dxDataGrid({\"dataSource\":{\"store\":new DevExpress.data.ArrayStore({\"data\":[{\"ID\":12761,\"EventID\":214,\"StartDate\":new Date(2023, 8, 4),\"EndDate\":new Date(2023, 8, 4),"
Note: In newer versions of R read_html seems to have a bug that causes problems with special encodings, which is why I load the data with the base function readLines(). ^
Let’s give it a try:
Error: lexical error: invalid char in json text.
DevExpress.utils.readyCallbacks
(right here) ------^
The problem is that this is not just the json itself, but also some Javascript to sort and render it:
[1] "DevExpress.utils.readyCallbacks.add((function($){$(\"#AcademicProgramme_ScheduleGrid\").dxDataGrid({\"dataSource\":{\"store\":new DevExpress.data.ArrayStore({\"data\":[{\"ID\":12761,\"EventID\":214,\"StartDate\":new Date(2023, 8, 4),\"EndDate\":new Date(2023, 8, 4),"
If you have a little experience with json, you can spot here that the actual data starts only here:
data.ArrayStore({\"data\"
(right here) ---^
json_clean <- json_string |>
str_remove_all(fixed("DevExpress.utils.readyCallbacks.add((function($){$(\"#AcademicProgramme_ScheduleGrid\").dxDataGrid({\"dataSource\":{\"store\":new DevExpress.data.ArrayStore("))
fromJSON(json_clean)Error: lexical error: invalid string in json text.
61,"EventID":214,"StartDate":new Date(2023, 8, 4),"EndDate":
(right here) ------^
It looks like there is some more Javascript code in the data. We can convert the date function into proper json data with a little regular expression:
str_replace_all('"StartDate":new Date(2023, 8, 4)', "new Date\\((\\d+), (\\d+), (\\d+)\\)", "\"\\1-\\2-\\3\"")[1] "\"StartDate\":\"2023-8-4\""
Perfect, let’s try this again:
json_clean <- json_string |>
str_remove_all(fixed("DevExpress.utils.readyCallbacks.add((function($){$(\"#AcademicProgramme_ScheduleGrid\").dxDataGrid({\"dataSource\":{\"store\":new DevExpress.data.ArrayStore(")) |>
str_replace_all("new Date\\((\\d+), (\\d+), (\\d+)\\)", "\"\\1-\\2-\\3\"")
fromJSON(json_clean)Error: parse error: trailing garbage
03e","IsInItinerary":false}]})},"showBorders":true,"showColu
(right here) ------^
Blimey! More garbage :(
It seems there is more problems at the end of the json string. So let’s have a look:
len <- nchar(json_string)
substr(json_string, len - 1100, len) # It took a few guesses to circle in on the 1100[1] "\\u003e\\r\\n\\u003cbr/\\u003e\\r\\n\\u003ca href=\\u0022/profile/ElisaBellè\\u0022 target=\\u0022_blank\\u0022\\u003eElisa\\u0026nbsp;Bellè\\u003c/a\\u003e\\r\\n\\u003cbr/\\u003e\",\"IsInItinerary\":false}]})},\"showBorders\":true,\"showColumnLines\":false,\"showRowLines\":true,\"paging\":{\"pageSize\":20},\"columnHidingEnabled\":true,\"columnAutoWidth\":true,\"wordWrapEnabled\":true,\"searchPanel\":{\"visible\":true,\"searchVisibleColumnsOnly\":true},\"columns\":[{\"dataField\":\"StartDate\",\"cssClass\":\"align-top\",\"visible\":false},{\"dataField\":\"EndDate\",\"cssClass\":\"align-top\",\"visible\":false},{\"dataField\":\"Activity\",\"cssClass\":\"align-top\",\"encodeHtml\":false,\"allowHiding\":false},{\"dataField\":\"Papers\",\"cssClass\":\"align-top\",\"encodeHtml\":false,\"allowHiding\":true,\"hidingPriority\":1},{\"dataField\":\"People\",\"cssClass\":\"align-top\",\"encodeHtml\":false,\"allowHiding\":true,\"hidingPriority\":0},{\"name\":\"ItineraryButton\",\"cssClass\":\"align-top\",\"caption\":\" \",\"encodeHtml\":false,\"allowHiding\":true,\"hidingPriority\":2,\"calculateCellValue\":function(data) { return scheduleGrid.formatItineraryButton(data); },\"alignment\":\"center\"}]});}).bind(this, jQuery));"
It seems Elisa\\u0026nbsp;Bellè is still part of the data and R guess pretty well which part is the garbage. Let’s try to remove the trailing Javascript bits.
Perfect, let’s try this again:
json_clean <- json_string |>
str_remove_all(fixed("DevExpress.utils.readyCallbacks.add((function($){$(\"#AcademicProgramme_ScheduleGrid\").dxDataGrid({\"dataSource\":{\"store\":new DevExpress.data.ArrayStore(")) |>
str_remove_all(fixed(")},\"showBorders\":true,\"showColumnLines\":false,\"showRowLines\":true,\"paging\":{\"pageSize\":20},\"columnHidingEnabled\":true,\"columnAutoWidth\":true,\"wordWrapEnabled\":true,\"searchPanel\":{\"visible\":true,\"searchVisibleColumnsOnly\":true},\"columns\":[{\"dataField\":\"StartDate\",\"cssClass\":\"align-top\",\"visible\":false},{\"dataField\":\"EndDate\",\"cssClass\":\"align-top\",\"visible\":false},{\"dataField\":\"Activity\",\"cssClass\":\"align-top\",\"encodeHtml\":false,\"allowHiding\":false},{\"dataField\":\"Papers\",\"cssClass\":\"align-top\",\"encodeHtml\":false,\"allowHiding\":true,\"hidingPriority\":1},{\"dataField\":\"People\",\"cssClass\":\"align-top\",\"encodeHtml\":false,\"allowHiding\":true,\"hidingPriority\":0},{\"name\":\"ItineraryButton\",\"cssClass\":\"align-top\",\"caption\":\" \",\"encodeHtml\":false,\"allowHiding\":true,\"hidingPriority\":2,\"calculateCellValue\":function(data) { return scheduleGrid.formatItineraryButton(data); },\"alignment\":\"center\"}]});}).bind(this, jQuery));")) |>
str_replace_all("new Date\\((\\d+), (\\d+), (\\d+)\\)", "\"\\1-\\2-\\3\"")
json_parsed <- fromJSON(json_clean)Smashing! R converted the data succesfully!
Let’s have a look at the data now:
List of 1
$ data:'data.frame': 607 obs. of 8 variables:
..$ ID : int [1:607] 12761 12045 12572 12400 12085 12425 12076 12411 12099 12058 ...
..$ EventID : int [1:607] 214 214 214 214 214 214 214 214 214 214 ...
..$ StartDate : chr [1:607] "2023-8-4" "2023-8-4" "2023-8-4" "2023-8-4" ...
..$ EndDate : chr [1:607] "2023-8-4" "2023-8-4" "2023-8-4" "2023-8-4" ...
..$ Activity : chr [1:607] "<i class=\"fa-solid fa-buildings\" title=\"In person icon\"></i>\r\nExecutive Committee Meeting (For Executive "| __truncated__ "<i class=\"fa-solid fa-buildings\" title=\"In person icon\"></i>\r\n<a href=\"/Events/Event/PanelDetails/12825\"| __truncated__ "<i class=\"fa-solid fa-buildings\" title=\"In person icon\"></i>\r\n<a href=\"/Events/Event/PanelDetails/12587\"| __truncated__ "<i class=\"fa-solid fa-buildings\" title=\"In person icon\"></i>\r\n<a href=\"/Events/Event/PanelDetails/12861\"| __truncated__ ...
..$ Papers : chr [1:607] "" "<a href=\"/Events/Event/PaperDetails/70630\"target=\"_blank\">Critique of Critical Policy Discourse Analysis</a"| __truncated__ "<a href=\"/Events/Event/PaperDetails/67670\"target=\"_blank\">Elites in times of epochal change. Empirical evid"| __truncated__ "<a href=\"/Events/Event/PaperDetails/67347\"target=\"_blank\">Does Political Polarization Affect Migrant Integr"| __truncated__ ...
..$ People : chr [1:607] "" "Chair: \r\n<a href=\"/profile/MichaelFarrelly\" target=\"_blank\">Michael Farrelly</a>\r\n<br/>\r\nCo-Chai"| __truncated__ "Chair: \r\n<a href=\"/profile/UrsulaHoffmann-Lange\" target=\"_blank\">Ursula Hoffmann-Lange</a>\r\n<br/>\"| __truncated__ "Chair: \r\n<a href=\"/profile/MaikHerold\" target=\"_blank\">Maik Herold</a>\r\n<br/>\r\nCo-Chair: \r\n<a "| __truncated__ ...
..$ IsInItinerary: logi [1:607] FALSE FALSE FALSE FALSE FALSE FALSE ...
ecpr_data is now a list that contains only 1 data.frame.data.frame seems to have a mix of clean data and HTML codeAs a first step, we pull the data out and save it in a new object:
[1] "<a href=\"/Events/Event/PaperDetails/70630\"target=\"_blank\">Critique of Critical Policy Discourse Analysis</a>\r\n<br/>\r\nAuthor: <a href=\"/profile/MichaelKranert\" target=\"_blank\">Michael Kranert</a>\r\n<br/>\r\n<br/>\r\n<a href=\"/Events/Event/PaperDetails/70631\"target=\"_blank\">Critique of Critical Policy Discourse Analysis</a>\r\n<br/>\r\nAuthor: <a href=\"/profile/DesGasper\" target=\"_blank\">Des Gasper</a>\r\n<br/>\r\n<br/>\r\n<a href=\"/Events/Event/PaperDetails/70632\"target=\"_blank\">An Editor Responds to Critique of Critical Policy Discourse Analysis</a>\r\n<br/>\r\nAuthor: <a href=\"/profile/MichaelFarrelly\" target=\"_blank\">Michael Farrelly</a>"
Now since we are used to looking at HTML not in its raw form, but in a browser, we can define a small function to write the HTML code to a file and ask the browser to render it:
html <- read_html(ecpr_data_df$Papers[2])
paper_title <- html |>
html_elements("[href*='PaperDetails']") |>
html_text2()
paper_title[1] "Critique of Critical Policy Discourse Analysis"
[2] "Critique of Critical Policy Discourse Analysis"
[3] "An Editor Responds to Critique of Critical Policy Discourse Analysis"
[1] "Michael Kranert" "Des Gasper" "Michael Farrelly"
extract_papers <- function(html) {
html <- read_html(html)
paper_title <- html |>
html_elements("[href*='PaperDetails']") |>
html_text2()
authors <- html |>
html_elements("[href*='profile']") |>
html_text2()
author_urls <- html |>
html_elements("[href*='profile']") |>
html_attr("href")
tibble(paper_title, authors, author_urls)
}# A tibble: 3 × 3
paper_title authors author_urls
<chr> <chr> <chr>
1 Critique of Critical Policy Discourse Analysis Michae… /profile/M…
2 Critique of Critical Policy Discourse Analysis Des Ga… /profile/D…
3 An Editor Responds to Critique of Critical Policy Discour… Michae… /profile/M…
ecpr_data <- ecpr_data_df |>
filter(Papers != "") |> # remove empty lines
mutate(papers = map(Papers, extract_papers)) # loop over all sessionsError in `mutate()`:
ℹ In argument: `papers = map(Papers, extract_papers)`.
Caused by error in `map()`:
ℹ In index: 2.
Caused by error in `tibble()`:
! Tibble columns must have compatible sizes.
• Size 5: Existing data.
• Size 6: Column at position 2.
ℹ Only values of size one are recycled.
Another error :( let’s investigate!
html <- ecpr_data_df |>
filter(Papers != "") |>
slice(2) |>
pull(Papers) |>
read_html()
paper_title <- html |>
html_elements("[href*='PaperDetails']") |>
html_text2()
authors <- html |>
html_elements("[href*='profile']") |>
html_text2()
author_urls <- html |>
html_elements("[href*='profile']") |>
html_attr("href")
paper_title[1] "Elites in times of epochal change. Empirical evidence and theoretical insights in an intertemporal perspective"
[2] "Determinants of Cooperation Potentials in the German Bundestag"
[3] "From elite consensus to asymmetric polarisation. Lessons learnt from the Hungarian case"
[4] "Band of Brothers: Projection of Power Centralization in Times of Crisis in Central European countries"
[5] "The Rule of Populist Nationalists and Elite Decay in Poland and Hungary"
[1] "Heinrich Best" "Ursula Hoffmann-Lange" "Gabriella Ilonszki"
[4] "Gyorgy Lengyel" "David Broul" "Jan Pakulski"
[1] "/profile/HeinrichBest1" "/profile/UrsulaHoffmann-Lange"
[3] "/profile/GabriellaIlonszki2" "/profile/GyorgyLengyel"
[5] "/profile/DavidBroul" "/profile/JanPakulski"
Makes sense that some papers have several authors, so let’s take another look at the HTML here:
<a href="/Events/Event/PaperDetails/67670"target="_blank">Elites in times of epochal change. Empirical evidence and theoretical insights in an intertemporal perspective</a>
<br/>
Author: <a href="/profile/HeinrichBest1" target="_blank">Heinrich Best</a>
<br/>
<br/>
<a href="/Events/Event/PaperDetails/67671"target="_blank">Determinants of Cooperation Potentials in the German Bundestag</a>
<br/>
Author: <a href="/profile/UrsulaHoffmann-Lange" target="_blank">Ursula Hoffmann-Lange</a>
<br/>
<br/>
<a href="/Events/Event/PaperDetails/67672"target="_blank">From elite consensus to asymmetric polarisation. Lessons learnt from the Hungarian case</a>
<br/>
Authors: <a href="/profile/GabriellaIlonszki2" target="_blank">Gabriella Ilonszki</a>, <a href="/profile/GyorgyLengyel" target="_blank">Gyorgy Lengyel</a>
<br/>
<br/>
<a href="/Events/Event/PaperDetails/70860"target="_blank">Band of Brothers: Projection of Power Centralization in Times of Crisis in Central European countries</a>
<br/>
Author: <a href="/profile/DavidBroul" target="_blank">David Broul</a>
<br/>
<br/>
<a href="/Events/Event/PaperDetails/71441"target="_blank">The Rule of Populist Nationalists and Elite Decay in Poland and Hungary</a>
<br/>
Author: <a href="/profile/JanPakulski" target="_blank">Jan Pakulski</a>
Annoyingly, there does not seem to be a good way to combine paper titles and authors from this HTML structure :(
On the other hand, html code is just text, so maybe we can manipulate the strings a little?
author_lines <- html |>
as.character() |>
strsplit(split = "\n") |>
pluck(1) |>
str_subset("^Author")
author_lines[1] "Author: <a href=\"/profile/HeinrichBest1\" target=\"_blank\">Heinrich Best</a>\r"
[2] "Author: <a href=\"/profile/UrsulaHoffmann-Lange\" target=\"_blank\">Ursula Hoffmann-Lange</a>\r"
[3] "Authors: <a href=\"/profile/GabriellaIlonszki2\" target=\"_blank\">Gabriella Ilonszki</a>, <a href=\"/profile/GyorgyLengyel\" target=\"_blank\">Gyorgy Lengyel</a>\r"
[4] "Author: <a href=\"/profile/DavidBroul\" target=\"_blank\">David Broul</a>\r"
[5] "Author: <a href=\"/profile/JanPakulski\" target=\"_blank\">Jan Pakulski</a>"
Now we have only 5 lines that mention an author!
Let’s try to parse them using a loop
map(author_lines, function(x) {
html <- read_html(x)
author_urls <- html |>
html_elements("a") |>
html_attr("href")
authors <- html |>
html_elements("a") |>
html_text2()
list(author_urls = author_urls, authors = authors)
})[[1]]
[[1]]$author_urls
[1] "/profile/HeinrichBest1"
[[1]]$authors
[1] "Heinrich Best"
[[2]]
[[2]]$author_urls
[1] "/profile/UrsulaHoffmann-Lange"
[[2]]$authors
[1] "Ursula Hoffmann-Lange"
[[3]]
[[3]]$author_urls
[1] "/profile/GabriellaIlonszki2" "/profile/GyorgyLengyel"
[[3]]$authors
[1] "Gabriella Ilonszki" "Gyorgy Lengyel"
[[4]]
[[4]]$author_urls
[1] "/profile/DavidBroul"
[[4]]$authors
[1] "David Broul"
[[5]]
[[5]]$author_urls
[1] "/profile/JanPakulski"
[[5]]$authors
[1] "Jan Pakulski"
This gives us a list with the right number of elements, while some elements contain several authors.
extract_papers <- function(html) {
html <- read_html(html)
paper_title <- html |>
html_elements("[href*='PaperDetails']") |>
html_text2()
authors <- html |>
as.character() |>
strsplit(split = "\n") |>
pluck(1) |>
str_subset("^Author")
profiles <- map(authors, function(x) {
# some author fields are empty, we need to check for that
if (nchar(x) <= 10L) {
return(list(author_urls = NA_character_, authors = NA_character_))
} else {
html <- read_html(x)
author_urls <- html |>
html_elements("a") |>
html_attr("href")
authors <- html |>
html_elements("a") |>
html_text2()
return(list(author_urls = author_urls, authors = authors))
}
})
tibble(paper_title, authors = map(profiles, "authors"), author_urls = map(profiles, "author_urls"))
}ecpr_data_df |>
filter(Papers != "") |>
slice(2) |>
pull(Papers) |>
extract_papers() |>
unnest(c(authors, author_urls)) # unnest puts each paper into its own row, duplicating ids where necessary# A tibble: 6 × 3
paper_title authors author_urls
<chr> <chr> <chr>
1 Elites in times of epochal change. Empirical evidence and… Heinri… /profile/H…
2 Determinants of Cooperation Potentials in the German Bund… Ursula… /profile/U…
3 From elite consensus to asymmetric polarisation. Lessons … Gabrie… /profile/G…
4 From elite consensus to asymmetric polarisation. Lessons … Gyorgy… /profile/G…
5 Band of Brothers: Projection of Power Centralization in T… David … /profile/D…
6 The Rule of Populist Nationalists and Elite Decay in Pola… Jan Pa… /profile/J…
ecpr_data <- ecpr_data_df |>
filter(Papers != "") |>
mutate(papers = map(Papers, extract_papers))
ecpr_data |>
select(panel_id = ID, event_id = EventID, papers) |>
unnest(papers) # A tibble: 2,276 × 5
panel_id event_id paper_title authors author_urls
<int> <int> <chr> <list> <list>
1 12045 214 Critique of Critical Policy Discourse … <chr> <chr [1]>
2 12045 214 Critique of Critical Policy Discourse … <chr> <chr [1]>
3 12045 214 An Editor Responds to Critique of Crit… <chr> <chr [1]>
4 12572 214 Elites in times of epochal change. Emp… <chr> <chr [1]>
5 12572 214 Determinants of Cooperation Potentials… <chr> <chr [1]>
6 12572 214 From elite consensus to asymmetric pol… <chr> <chr [2]>
7 12572 214 Band of Brothers: Projection of Power … <chr> <chr [1]>
8 12572 214 The Rule of Populist Nationalists and … <chr> <chr [1]>
9 12400 214 Does Political Polarization Affect Mig… <chr> <chr [1]>
10 12400 214 Beyond Party Identities: Mapping and A… <chr> <chr [1]>
# ℹ 2,266 more rows
ecpr_data_tidy <- ecpr_data |>
select(panel_id = ID, event_id = EventID, papers) |>
unnest(papers) |>
unnest(c(authors, author_urls)) |>
distinct(paper_title, authors, .keep_all = TRUE)
ecpr_data_tidy |>
count(authors, sort = TRUE)# A tibble: 2,897 × 2
authors n
<chr> <int>
1 <NA> 7
2 Theodore Chadjipadelis 6
3 André Bächtiger 5
4 Arjan H. Schakel 5
5 Peter H. Feindt 5
6 Robert A. Huber 5
7 Sebastian Ludwicki-Ziegler 5
8 Aleh Cherp 4
9 Carolien van Ham 4
10 Carolina Galais 4
# ℹ 2,887 more rows
However:
library(pdftools)
comptext <- pdf_data("https://www.comptextconference.org/wp-content/uploads/2023/05/COMPTEXT-2023-programme-May-9-2023.pdf", font_info = TRUE)
comptext[[7]]# A tibble: 345 × 8
width height x y space text font_name font_size
<int> <int> <int> <int> <lgl> <chr> <chr> <dbl>
1 46 12 63 63 TRUE Friday, WMKLBZ+SFBX1440 14.3
2 30 12 116 63 TRUE May WMKLBZ+SFBX1440 14.3
3 15 12 152 63 FALSE 12 WMKLBZ+SFBX1440 14.3
4 19 8 318 78 TRUE ‘The VQHLVV+SFTI1000 9.96
5 64 8 342 78 TRUE Representation VQHLVV+SFTI1000 9.96
6 8 8 410 78 TRUE of VQHLVV+SFTI1000 9.96
7 66 8 422 78 TRUE Gender-Related VQHLVV+SFTI1000 9.96
8 25 8 493 78 TRUE Issues VQHLVV+SFTI1000 9.96
9 8 8 522 78 FALSE in VQHLVV+SFTI1000 9.96
10 49 8 318 90 TRUE Politicians’ VQHLVV+SFTI1000 9.96
# ℹ 335 more rows
We see here that:
Let’s investigate a few words we saw above:
# A tibble: 1 × 8
width height x y space text font_name font_size
<int> <int> <int> <int> <lgl> <chr> <chr> <dbl>
1 56 8 109 179 TRUE Intra-Party ZGRRLU+SFBX1000 9.96
# A tibble: 1 × 8
width height x y space text font_name font_size
<int> <int> <int> <int> <lgl> <chr> <chr> <dbl>
1 57 8 87 245 TRUE Politicization VQHLVV+SFTI1000 9.96
# A tibble: 2 × 8
width height x y space text font_name font_size
<int> <int> <int> <int> <lgl> <chr> <chr> <dbl>
1 38 8 63 353 TRUE Alberto ZGRRLU+SFBX1000 9.96
2 33 8 353 637 FALSE Alberto ZXOBMV+SFRM1000 9.96
# A tibble: 1 × 8
width height x y space text font_name font_size
<int> <int> <int> <int> <lgl> <chr> <chr> <dbl>
1 43 8 63 424 TRUE Agnieszka ZXOBMV+SFRM1000 9.96
# A tibble: 1 × 8
width height x y space text font_name font_size
<int> <int> <int> <int> <lgl> <chr> <chr> <dbl>
1 21 8 318 102 TRUE Yael ZGRRLU+SFBX1000 9.96
# A tibble: 1 × 8
width height x y space text font_name font_size
<int> <int> <int> <int> <lgl> <chr> <chr> <dbl>
1 32 8 244 109 FALSE Innova- ZXOBMV+SFRM1000 9.96
It looks like we can distinguish some types of information by the font_name and the info at which y and x position a word appears.
As a first step, we want the text in the left column to appear before the text in the right column. We can use the x position for this.
tidy_page <- function(page) {
left_column <- page |>
filter(x < 300) |>
group_by(y) |>
summarise(text = paste(text, collapse = " "),
font_name = list(unique(font_name)))
right_column <- page |>
filter(x > 300) |>
group_by(y) |>
summarise(text = paste(text, collapse = " "),
font_name = list(unique(font_name)))
bind_rows(left_column, right_column)
}comptext_data <- comptext[7:14] |>
# bring the text in the correct order on each page
map(tidy_page) |>
bind_rows() |>
# paper titles use a specific font, which make them easy to identify
mutate(paper_title = font_name == "VQHLVV+SFTI1000") |>
# lines with more than one font (bold and non-bold) are meta information, let's note that
mutate(font_name = map_chr(font_name, function(x) ifelse(length(x) > 1, "meta", x))) |>
# Panels start with the string panel and a large bold font
mutate(panel_start = font_name == "ZGRRLU+SFBX1000" & str_detect(text, "Panel")) |>
# using cumsum, we count the TRUE values. One appears at the start of each panel
mutate(panel = cumsum(panel_start)) |>
# we are not interested in the information before the first panel
filter(panel > 0) |>
group_by(panel) |>
# we can count the first appearance of a paper title to give the paper a unique id per panel
mutate(paper_nr = cumsum(paper_title & !lag(paper_title))) |>
# now we can group the data using this id to make sure the authors are matched to the right paper
group_by(panel, paper_nr) |>
# summarised takes several elements and uses a function to return exactly one
summarise(
# we select all text where the paper_nr == 0, which is te panel title
panel_title = paste(text[paper_nr == 0], collapse = " "),
# now we combine the text in the rows that have a paper title
paper_title = paste(text[paper_title], collapse = " "),
# finally, we put the authors in a list. They are the text within a panel descript that uses one of two fonts
author = list(text[font_name %in% c("ZXOBMV+SFRM1000", "ZGRRLU+SFBX1000")]),
.groups = "drop"
) |>
# we replace empty rows with NAs so we can fill them with the preeeding text
mutate(across(panel_title:paper_title, function(x) ifelse(x == "", NA_character_, x))) |>
fill(panel_title) |>
filter(paper_nr > 0) |>
unnest(author)
comptext_data# A tibble: 259 × 5
panel paper_nr panel_title paper_title author
<int> <int> <chr> <chr> <chr>
1 1 1 Panel 1: Intra-Party Politics Location: Co… ‘The Polit… Álvar…
2 1 1 Panel 1: Intra-Party Politics Location: Co… ‘The Polit… Mathi…
3 1 1 Panel 1: Intra-Party Politics Location: Co… ‘The Polit… Alexa…
4 1 2 Panel 1: Intra-Party Politics Location: Co… ‘From MP t… Alber…
5 1 3 Panel 1: Intra-Party Politics Location: Co… ‘Ideologic… Huber…
6 1 3 Panel 1: Intra-Party Politics Location: Co… ‘Ideologic… Agnie…
7 1 3 Panel 1: Intra-Party Politics Location: Co… ‘Ideologic… Vikto…
8 2 1 Panel 2: Web Scraping Location: Conference… ‘Towards a… Alona…
9 2 1 Panel 2: Web Scraping Location: Conference… ‘Towards a… Lena …
10 2 1 Panel 2: Web Scraping Location: Conference… ‘Towards a… Will …
# ℹ 249 more rows
You have seen some tools and tricks to scrape websites now. But your best ally in web scraping is experience! Until tomorrow noon, your task is to find some conference you could imagine attending and scrape their website to get the same information we extracted today. Even if you don’t fully succeed, document the steps you take and note down where the information can be found. If you collect raw html in R and the data is not where it should be (e.g., the html elements containing panel names do not exist), you might have discovered a more advanced site. Note that down and try another conference.
Deadline: Thursday noon
Save some information about the session for reproducibility.
R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: EndeavourOS
Matrix products: default
BLAS: /usr/lib/libblas.so.3.11.0
LAPACK: /usr/lib/liblapack.so.3.11.0
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=nl_NL.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/Amsterdam
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] pdftools_3.3.3 jsonlite_1.8.7 lubridate_1.9.2 forcats_1.0.0
[5] stringr_1.5.0 dplyr_1.1.2 purrr_1.0.1 readr_2.1.4
[9] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0
[13] rvest_1.0.3
loaded via a namespace (and not attached):
[1] gtable_0.3.3 selectr_0.4-2 qpdf_1.3.2 compiler_4.3.1
[5] Rcpp_1.0.11 tidyselect_1.2.0 xml2_1.3.5 scales_1.2.1
[9] yaml_2.3.7 fastmap_1.1.1 R6_2.5.1 generics_0.1.3
[13] curl_5.0.1 knitr_1.43 munsell_0.5.0 pillar_1.9.0
[17] tzdb_0.4.0 rlang_1.1.1 utf8_1.2.3 stringi_1.7.12
[21] xfun_0.39 timechange_0.2.0 cli_3.6.1 withr_2.5.0
[25] magrittr_2.0.3 digest_0.6.33 grid_4.3.1 rstudioapi_0.15.0
[29] askpass_1.1 hms_1.1.3 lifecycle_1.0.3 vctrs_0.6.3
[33] evaluate_0.21 glue_1.6.2 codetools_0.2-19 fansi_1.0.4
[37] colorspace_2.1-0 rmarkdown_2.23 httr_1.4.6 tools_4.3.1
[41] pkgconfig_2.0.3 htmltools_0.5.5